If our data is not completely linearly separable due to a few points, depsite still looking like a linear classifer would generally do a good job, we can use Soft Margin SVMs. These are a type of Support Vector Machine that allows for a certain amount of error during training.
In contrast to Hard Margin SVMs, the contraint of \(y_i(\langle\mathbf{w},\mathbf{x}_i\rangle+b)\ge 1\) becomes \(y_i(\langle\mathbf{w},\mathbf{x}_i\rangle+b)\ge 1 - \zeta_i\). The term \(\zeta_i \) measures by how much the Hard Margin SVM contraint is being violated. The Soft Margin SVM then tries to maximise the width of the margin whilst also minimising the average value of \( \zeta_i \). The Hinge Loss is the tradeoff between \(\zeta_i\) and the margin width and is known as \( \lambda\).
The use of the \( \lambda \) parameter is a tradeoff between having a large margin and correctly classifying the training data. You get better classification of training data at the expense of a wide margin. By choosing a higher value of \( \lambda \) you imply that you want less errors on the training data.
The choice of \( \lambda \) can be a complicated task in istelf and sometimes requires complex algorithms to decide it. Have a look at the plot below to see how adjusting the size of \( \lambda \) affects the margin width.
Click to add an orange point.
Shift + Click to add an blue point.
Compared to the hard margin SVM, it's not possible anymore to have both a good separating boundary as well as a large point-free margin. This means that some points creep into the margin.
Sometimes data is completely non-linearly separable.
We aren't always as lucky as we are when we have only a few overlaps of data.
However, all is not lost. Support Vector Machines can deal with non linearly separable data too using a nifty trick!